A string is a bunch of characters.
Don’t confuse a string (many characters, one object) with a character vector (vector of strings).
stringrCommon tasks
Find which strings contain a particular pattern
Remove or replace a pattern
Edit a string (for example, make it lowercase)
Note
The package stringr is very useful for strings!
stringr does load with the tidyverse.
all the functions are str_xxx().
pattern =The pattern argument in all of the stringr functions …
Note
Discuss with a neighbor. For each of these functions, give:
str_detect()Returns logical vector TRUE/FALSE indicating if the pattern was found in that element of the original vector
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_detect(my_vector, pattern = "Bond")[1] FALSE FALSE TRUE TRUE
filter()summarise() and sum() or mean()Related functions
str_subset() returns just the strings that contain the match
str_which() returns the indexes of strings that have a match
str_match()Returns character matrix with either NA or the pattern, depending on if the pattern was found.
str_extract()Returns character vector with either NA or the pattern, depending on if the pattern was found.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_extract(my_vector, pattern = "Bond")[1] NA NA "Bond" "Bond"
Warning
str_extract() only returns the first pattern match.
Use str_extract_all() to return every pattern match.
str_locate()Returns a date frame with two numeric variables for the starting and ending location, giving either NA or the start and end position of the pattern.
str_subset()Returns a character vector with a subset of the original character vector with elements where the pattern occurs.
my_vector <- c("Hello,", "my name is", "Bond", "James Bond")
str_subset(my_vector, pattern = "Bond")[1] "Bond" "James Bond"
Related Functions
str_sub() extracts values based on location (starting and ending position).
Replaces the first matched pattern
mutate()Related functions
str_replace_all() replaces all matched patterns
str_remove_all() removes all matched patterns
Convert letters in the string to a specific capitalization format.
converts all letters in the strings to lowercase
converts all letters in the strings to uppercase
Joins multiple strings into a single string
sep argument declares how the strings should be separated when pastingprompt <- "Hello, my name is"
first <- "James"
last <- "Bond"
str_c(prompt, last, ",", first, last, sep = " ")[1] "Hello, my name is Bond , James Bond"
Note
Similar to paste() and paste0()
Combines into a single string.
[1] "Hello, my name is Bond James Bond"
Note
str_c() will do the same thing, but it it is encouraged to use str_flatten() instead.
Uses environment to create a string and evaluates {expressions}.
My name is Bond, James Bond
Tip
See the R package glue! We will use these tools a lot in Week 7!
Refer to the stringr cheatsheet
Remember that str_xxx functions need the first argument to be a vector of strings, not a data set.
filter() or mutate(). name is_bran manuf type calories protein
1 100% Bran TRUE N cold 70 4
2 100% Natural Bran TRUE Q cold 120 3
3 All-Bran TRUE K cold 70 4
4 All-Bran with Extra Fiber TRUE K cold 50 4
5 Almond Delight FALSE R cold 110 2
6 Apple Cinnamon Cheerios FALSE G cold 110 2
7 Apple Jacks FALSE K cold 110 2
8 Basic 4 FALSE G cold 130 3
9 Bran Chex TRUE R cold 90 2
10 Bran Flakes TRUE P cold 90 3
11 Cap'n'Crunch FALSE Q cold 120 1
12 Cheerios FALSE G cold 110 6
13 Cinnamon Toast Crunch FALSE G cold 120 1
14 Clusters FALSE G cold 110 3
15 Cocoa Puffs FALSE G cold 110 1
16 Corn Chex FALSE R cold 110 2
17 Corn Flakes FALSE K cold 100 2
18 Corn Pops FALSE K cold 110 1
19 Count Chocula FALSE G cold 110 1
20 Cracklin' Oat Bran TRUE K cold 110 3
21 Cream of Wheat (Quick) FALSE N hot 100 3
22 Crispix FALSE K cold 110 2
23 Crispy Wheat & Raisins FALSE G cold 100 2
24 Double Chex FALSE R cold 100 2
25 Froot Loops FALSE K cold 110 2
26 Frosted Flakes FALSE K cold 110 1
27 Frosted Mini-Wheats FALSE K cold 100 3
28 Fruit & Fibre Dates; Walnuts; and Oats FALSE P cold 120 3
29 Fruitful Bran TRUE K cold 120 3
30 Fruity Pebbles FALSE P cold 110 1
31 Golden Crisp FALSE P cold 100 2
32 Golden Grahams FALSE G cold 110 1
33 Grape Nuts Flakes FALSE P cold 100 3
34 Grape-Nuts FALSE P cold 110 3
35 Great Grains Pecan FALSE P cold 120 3
36 Honey Graham Ohs FALSE Q cold 120 1
37 Honey Nut Cheerios FALSE G cold 110 3
38 Honey-comb FALSE P cold 110 1
39 Just Right Crunchy Nuggets FALSE K cold 110 2
40 Just Right Fruit & Nut FALSE K cold 140 3
41 Kix FALSE G cold 110 2
42 Life FALSE Q cold 100 4
43 Lucky Charms FALSE G cold 110 2
44 Maypo FALSE A hot 100 4
45 Muesli Raisins; Dates; & Almonds FALSE R cold 150 4
46 Muesli Raisins; Peaches; & Pecans FALSE R cold 150 4
47 Mueslix Crispy Blend FALSE K cold 160 3
48 Multi-Grain Cheerios FALSE G cold 100 2
49 Nut&Honey Crunch FALSE K cold 120 2
50 Nutri-Grain Almond-Raisin FALSE K cold 140 3
51 Nutri-grain Wheat FALSE K cold 90 3
52 Oatmeal Raisin Crisp FALSE G cold 130 3
53 Post Nat. Raisin Bran TRUE P cold 120 3
54 Product 19 FALSE K cold 100 3
55 Puffed Rice FALSE Q cold 50 1
56 Puffed Wheat FALSE Q cold 50 2
57 Quaker Oat Squares FALSE Q cold 100 4
58 Quaker Oatmeal FALSE Q hot 100 5
59 Raisin Bran TRUE K cold 120 3
60 Raisin Nut Bran TRUE G cold 100 3
61 Raisin Squares FALSE K cold 90 2
62 Rice Chex FALSE R cold 110 1
63 Rice Krispies FALSE K cold 110 2
64 Shredded Wheat FALSE N cold 80 2
65 Shredded Wheat 'n'Bran TRUE N cold 90 3
66 Shredded Wheat spoon size FALSE N cold 90 3
67 Smacks FALSE K cold 110 2
68 Special K FALSE K cold 110 6
69 Strawberry Fruit Wheats FALSE N cold 90 2
70 Total Corn Flakes FALSE G cold 110 2
71 Total Raisin Bran TRUE G cold 140 3
72 Total Whole Grain FALSE G cold 100 3
73 Triples FALSE G cold 110 2
74 Trix FALSE G cold 110 1
75 Wheat Chex FALSE R cold 100 3
76 Wheaties FALSE G cold 100 3
77 Wheaties Honey Gold FALSE G cold 110 2
fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
1 1 130 10.0 5.0 6 280 25 3 1.00 0.33 68.40297
2 5 15 2.0 8.0 8 135 0 3 1.00 1.00 33.98368
3 1 260 9.0 7.0 5 320 25 3 1.00 0.33 59.42551
4 0 140 14.0 8.0 0 330 25 3 1.00 0.50 93.70491
5 2 200 1.0 14.0 8 -1 25 3 1.00 0.75 34.38484
6 2 180 1.5 10.5 10 70 25 1 1.00 0.75 29.50954
7 0 125 1.0 11.0 14 30 25 2 1.00 1.00 33.17409
8 2 210 2.0 18.0 8 100 25 3 1.33 0.75 37.03856
9 1 200 4.0 15.0 6 125 25 1 1.00 0.67 49.12025
10 0 210 5.0 13.0 5 190 25 3 1.00 0.67 53.31381
11 2 220 0.0 12.0 12 35 25 2 1.00 0.75 18.04285
12 2 290 2.0 17.0 1 105 25 1 1.00 1.25 50.76500
13 3 210 0.0 13.0 9 45 25 2 1.00 0.75 19.82357
14 2 140 2.0 13.0 7 105 25 3 1.00 0.50 40.40021
15 1 180 0.0 12.0 13 55 25 2 1.00 1.00 22.73645
16 0 280 0.0 22.0 3 25 25 1 1.00 1.00 41.44502
17 0 290 1.0 21.0 2 35 25 1 1.00 1.00 45.86332
18 0 90 1.0 13.0 12 20 25 2 1.00 1.00 35.78279
19 1 180 0.0 12.0 13 65 25 2 1.00 1.00 22.39651
20 3 140 4.0 10.0 7 160 25 3 1.00 0.50 40.44877
21 0 80 1.0 21.0 0 -1 0 2 1.00 1.00 64.53382
22 0 220 1.0 21.0 3 30 25 3 1.00 1.00 46.89564
23 1 140 2.0 11.0 10 120 25 3 1.00 0.75 36.17620
24 0 190 1.0 18.0 5 80 25 3 1.00 0.75 44.33086
25 1 125 1.0 11.0 13 30 25 2 1.00 1.00 32.20758
26 0 200 1.0 14.0 11 25 25 1 1.00 0.75 31.43597
27 0 0 3.0 14.0 7 100 25 2 1.00 0.80 58.34514
28 2 160 5.0 12.0 10 200 25 3 1.25 0.67 40.91705
29 0 240 5.0 14.0 12 190 25 3 1.33 0.67 41.01549
30 1 135 0.0 13.0 12 25 25 2 1.00 0.75 28.02576
31 0 45 0.0 11.0 15 40 25 1 1.00 0.88 35.25244
32 1 280 0.0 15.0 9 45 25 2 1.00 0.75 23.80404
33 1 140 3.0 15.0 5 85 25 3 1.00 0.88 52.07690
34 0 170 3.0 17.0 3 90 25 3 1.00 0.25 53.37101
35 3 75 3.0 13.0 4 100 25 3 1.00 0.33 45.81172
36 2 220 1.0 12.0 11 45 25 2 1.00 1.00 21.87129
37 1 250 1.5 11.5 10 90 25 1 1.00 0.75 31.07222
38 0 180 0.0 14.0 11 35 25 1 1.00 1.33 28.74241
39 1 170 1.0 17.0 6 60 100 3 1.00 1.00 36.52368
40 1 170 2.0 20.0 9 95 100 3 1.30 0.75 36.47151
41 1 260 0.0 21.0 3 40 25 2 1.00 1.50 39.24111
42 2 150 2.0 12.0 6 95 25 2 1.00 0.67 45.32807
43 1 180 0.0 12.0 12 55 25 2 1.00 1.00 26.73451
44 1 0 0.0 16.0 3 95 25 2 1.00 1.00 54.85092
45 3 95 3.0 16.0 11 170 25 3 1.00 1.00 37.13686
46 3 150 3.0 16.0 11 170 25 3 1.00 1.00 34.13976
47 2 150 3.0 17.0 13 160 25 3 1.50 0.67 30.31335
48 1 220 2.0 15.0 6 90 25 1 1.00 1.00 40.10596
49 1 190 0.0 15.0 9 40 25 2 1.00 0.67 29.92429
50 2 220 3.0 21.0 7 130 25 3 1.33 0.67 40.69232
51 0 170 3.0 18.0 2 90 25 3 1.00 1.00 59.64284
52 2 170 1.5 13.5 10 120 25 3 1.25 0.50 30.45084
53 1 200 6.0 11.0 14 260 25 3 1.33 0.67 37.84059
54 0 320 1.0 20.0 3 45 100 3 1.00 1.00 41.50354
55 0 0 0.0 13.0 0 15 0 3 0.50 1.00 60.75611
56 0 0 1.0 10.0 0 50 0 3 0.50 1.00 63.00565
57 1 135 2.0 14.0 6 110 25 3 1.00 0.50 49.51187
58 2 0 2.7 -1.0 -1 110 0 1 1.00 0.67 50.82839
59 1 210 5.0 14.0 12 240 25 2 1.33 0.75 39.25920
60 2 140 2.5 10.5 8 140 25 3 1.00 0.50 39.70340
61 0 0 2.0 15.0 6 110 25 3 1.00 0.50 55.33314
62 0 240 0.0 23.0 2 30 25 1 1.00 1.13 41.99893
63 0 290 0.0 22.0 3 35 25 1 1.00 1.00 40.56016
64 0 0 3.0 16.0 0 95 0 1 0.83 1.00 68.23588
65 0 0 4.0 19.0 0 140 0 1 1.00 0.67 74.47295
66 0 0 3.0 20.0 0 120 0 1 1.00 0.67 72.80179
67 1 70 1.0 9.0 15 40 25 2 1.00 0.75 31.23005
68 0 230 1.0 16.0 3 55 25 1 1.00 1.00 53.13132
69 0 15 3.0 15.0 5 90 25 2 1.00 1.00 59.36399
70 1 200 0.0 21.0 3 35 100 3 1.00 1.00 38.83975
71 1 190 4.0 15.0 14 230 100 3 1.50 1.00 28.59278
72 1 200 3.0 16.0 3 110 100 3 1.00 1.00 46.65884
73 1 250 0.0 21.0 3 60 25 3 1.00 0.75 39.10617
74 1 140 0.0 13.0 12 25 25 2 1.00 1.00 27.75330
75 1 230 3.0 17.0 3 115 25 1 1.00 0.67 49.78744
76 1 200 3.0 17.0 3 110 25 1 1.00 1.00 51.59219
77 1 200 1.0 16.0 8 60 25 1 1.00 0.75 36.18756
“Regexps are a very terse language that allow you to describe patterns in strings.”
R for Data Science
R uses “extended” regular expressions, which are common.
Tip
Regular expressions are a reason to use stringr!
You might encounter gsub(), grep(), etc. from Base R.
. ^ $ \ | * + ? { } [ ] ( )[1] "She" "sells" "seashells" "by" "the" "seashore!"
. Represents any character
[1] "She" "sells" "seashells" "by" "the" "seashore!"
^ Looks at the beginning
$ Looks at the end
[1] "shes" "shels" "shells" "shellls" "shelllls"
? Occurs 0 or 1 times
+ Occurs 1 or more times
* Occurs 0 or more times
[1] "shes" "shels" "shells" "shellls" "shelllls"
{n} matches exactly n times.
{n,} matches at least n times.
{n,m} matches between n and m times.
()Groups can be created with ( )
| – “either” / “or”
toung_twister2 <- c("Peter", "Piper", "picked", "a", "peck", "of", "pickled", "peppers!")
toung_twister2[1] "Peter" "Piper" "picked" "a" "peck" "of" "pickled"
[8] "peppers!"
[][] Character Classes\w Looks for any “word” (conversely “not” “word” \W)
\d Looks for any digit (conversely “not” digit \D)
\s Looks for any whitespace (conversely “not” whitespace \S)
Discuss with a neighbor which regular expressions would search for words that do the following:
Test your answers out on
\In order to match a special character you need to “escape” first
Warning
In general, look at punctuation characters with suspicion.
[1] "How" "much" "wood" "could" "a" "woodchuck"
[7] "chuck" "if" "a" "woodchuck" "could" "chuck"
[13] "wood?"
Error in stri_subset_regex(string, pattern, omit_na = TRUE, negate = negate, : Syntax error in regex pattern. (U_REGEX_RULE_SYNTAX, context=`?`)
str_view() and str_view_all()Read the regular expressions out loud like a “request”
Test out your expressions on small examples first
str_view() and str_view_all()I use the stringr cheatsheet more than any other package cheatsheet!
tidyversematches(pattern)Selects all variables with a name that matches the supplied pattern
select(), rename_with(), and across()I received this data from a grad school colleague the other day who asked if I knew how to “clean” it.
What is that column?! 😮
[{'variant': 'Other', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 4.59}, {'variant': 'V-20DEC-01 (Alpha)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21APR-02 (Delta B.1.617.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-21OCT-01 (Delta AY 4.2)', 'cumWeeklySequenced': 0.0, 'newWeeklyPercentage': 0.0}, {'variant': 'V-22DEC-01 (Omicron CH.1.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 24.56}, {'variant': 'V-22JUL-01 (Omicron BA.2.75)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 8.93}, {'variant': 'V-22OCT-01 (Omicron BQ.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 49.57}, {'variant': 'VOC-21NOV-01 (Omicron BA.1)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.02}, {'variant': 'VOC-22APR-03 (Omicron BA.4)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 0.08}, {'variant': 'VOC-22APR-04 (Omicron BA.5)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.59}, {'variant': 'VOC-22JAN-01 (Omicron BA.2)', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 1.41}, {'variant': 'unclassified_variant', 'cumWeeklySequenced': 2366843.0, 'newWeeklyPercentage': 5.26}]
stringr! 🎉Let’s see how this works.
In this activity, you will be using regular expressions to decode a message.
Remember, the stringr functions go inside dplyr verbs like mutate() and filter(). Think of them as you would as.factor()
[1] "How" "much" "wood" "could" "a" "woodchuck"
[7] "chuck" "if" "a" "woodchuck" "could" "chuck"
[13] "wood?"